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re-equating were explored using data from a spring administration of the SAT 
for 9,517 test takers in 10 subgroups. By using a "dissection" approach to 
reference and focal group formations, this two-way classification scheme may 
yield new and detailed insight into item functioning at the subgroup level. 
Two hypotheses were studied: (1) whether or not the deletion of sizeable DIF 

items disadvantageous to a particular subgroup will affect that subgroup the 
most; and (2) whether or not the effects of item deletion on scores can be 
predicted by the standardization method. Both hypotheses were predicted by 
the results of this research. Scaled score differences following item 
deletion and re-equating varied among subgroups, depending on the DIF 
effects. Subgroups disadvantaged by the subsequently deleted sizable DIF 
items gained scaled score points whereas advantaged groups lost. Regression 
analyses confirmed the second hypothesis. It was also shown that by deleting 
an item with sizable negative DIF, the focal group might be greatly 
benefited. Among three item deletion scenarios, DIF effects yielded from the 
two-way classification scheme showed very little interaction in the majority 
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Abstract 

Statistical procedures for detecting differential item functioning (DIF) are often used to 
screen items for construct irrelevant variance. Standard DIF detection procedures focus on only 
one categorical variable at an aggregated group or one-way level, like gender or ethnicity/race. 
Building on previous work by Hu and Dorans (1989), Dorans and Holland (1993), and Zhang 
(2001), this research applies a DIF dissection classification scheme to SAT I; Verbal data. 
Subsequently, the effect of deleting sizable DIF items on reported scores after equipercentile re- 
equating were explored. By using a “Dissection” approach to reference and focal group 
formations, this two-way classification scheme may yield new and detailed insight into item 
functioning at the subgroup level. Two hypotheses were studied: (1) whether or not the deletion 
of sizeable DIF items disadvantageous to a particular subgroup will affect that subgroup the most 
and (2) whether or not the effects of item deletion on scores can be predicted by the 
standardization method. Both hypotheses were supported by the results of this research. Scaled 
score differences following item deletion and re-equating varied among subgroups, depending on 
the DIF effects. Subgroups disadvantaged by the subsequently deleted sizable DIF items gained 
scaled score points whereas those advantaged, lost. Regression analyses confirmed the second 
hypothesis. It was also shown that by deleting an item with sizable negative DIF, the focal group 
might be greatly benefited. Among three item deletion scenarios, DIF effects yielded from the 
two-way classification scheme showed very little interaction in the majority of cases. 
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Background 

Standardized achievement tests often have high stakes attached to their use. Statistical 
procedures for detecting differential item functioning (DIF) are frequently used to screen items 
for construct irrelevant variance. Standard DIF detection procedures focus on only one 
categorical variable at an aggregated group level, such as gender or ethnicity/race. To date, DIF 
studies in the arena of standardized achievement testing have investigated gender separately from 
ethnicity/race (e.g., Calton & Harris, 1992; Doolittle & Cleary, 1987; O’Neil & McPeek, 1993; 
Scheuneman & Grima, 1997; and Schmitt & Dorans, 1990). 

Hu and Dorans (1989) used data from the SAT I: Verbal test to examine the effect of deleting 
both minimal and sizable DIF items on equating functions and subsequent reported scores. The 
hypothesis they tested was whether or not the deletion of rninimal and/or sizable DIF items resulted in 
different scaled scores after IRT true score re-equating and Tucker re-equating. The results of that 
study indicated that though deleting certain items affected scaled scores in general, the act of deleting 
the item itself had a larger effect on scaled scores than did the extent of DIF of the deleted items. 

Dorans and Holland (1993) pointed out that in traditional one-way DIF analysis, deleting 
items due to DIF can have unintended consequences on the focal group. DIF analysis performed on 
gender and on ethnicity/race alone ignores the potential interactions between the two main effects. 
Additionally, Dorans and Holland suggested applying a “Melting Pot” or “Dissection” DIF method / 
wherein the total group would function as the reference group and each gender-by-ethnic subgroup 
would serve sequentially as a focal group. 
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23iang (2001) argued that DIF analysis with a traditional one-way approach does not serve the 
piupose of illuminating actual gender and ethnic/racial performance differences. A two-way DIF 
classification scheme was proposed, in which each item was examined for DIF effect at the subgroup 
level, i.e., gender DIF within ethnicity/race and ethnicity/race DIF Avithin gender. The results of that 
study identified several gender and ethnic/racial DIF items which were previously undetected in a 
total analysis and yet were flagged when two-way procedures were applied. 

Research Questions 

Building on previous work by Dorans and Holland (1993), Hu and Dorans (1989), and Zhang 
(2001), this research applies a two-way DIF classification scheme to SAT® I; Reasoning Test (SAT): 
Verbal data. Subsequently, the effect of deleting sizable DIF items on reported scores after 
equipercentile re-equating was examined. As mentioned earlier, this two-way classification scheme 
utilizes non-traditional reference and focal group formations. For piuposes of this research, this 
approach will be referred to as “DIF dissection.” In DIF dissection, each subgroup will act as an 
independent fijcal group while the total group will function as the reference group. In essence, the 
total group is dissected into a set of complementary focal groups. It is believed that using this 
approach to reference and focal group formation may provide detailed information about item 
performance at the subgroup-level. 

There were three goals to this research: (1) to examine items for DIF using the above- 
described DIF dissection classification scheme within the standardization DEF detection 
procedure, (2) to assess the effect, if any, of deleting sizable DIF items fi-om all groups on the 
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reported score after re-equating the shortened tests, and (3) to make recommendations regarding 
future routine DIF detection procedures. 

The hypotheses to be tested are the following; 

1. The deletion of DIF items disadvantageous to a particular subgroup will affect that 
subgroup the most; 

2. The effects of item deletion on scores will be predicted by the standardization method. 

All items of a particular SAT I: Verbal pretest that were flagged for sizable levels of DIF 
during standard operational analyses were removed from the response vectors of the affected group as 
well as from all other groups. Sizable DIF is defined according to the ETS delta criteria and will be 
elaborated upon more in the later part of this report. Reported score distribution and score changes of 
each ethnic and gender group were then examined after the systematic deletion of each item. The 
standardization method (Dorans & Kulick, 1986) was chosen for this work because it is easily adapted 
to formula scored test as well as to the scenario of multi-group analyses. It also lends itself well to the 
prediction of effects of item deletion on subgroup performance. 

Method 



Data Source 

Data were obtained from a Spring 2001 administration of the SAT. All test editions 
consisted of 78 five-option multiple-choice verbal items. In addition to these operational items, 
each test contained a 30-minute, non-operational section that was used for equating purposes as 
well as for pretesting new items. This research is limited to the use of 35 five-option multiple- 
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choice verbal pretest sections. Instructions to test takers directed them to choose the best of the 
five provided options for each item. 

For this research, examinees were classified by both gender and ethnicity/race. 

Following the subgroup classification scheme used by Dorans and Holland (2000), we placed all 
examinees who indicated their gender but not their ethnicity/race in a group labeled as “All 
Others.” In addition. Native Americans were also placed in “All Others” since this particular 
sample size was too small to withstand subgroup-level analyses. 

A total often subgroups were formed: African American Females, African American 
Males, Asian Females, Asian Males, Hispanic Females, Hispanic Males, White Females, White 
Males, All Other Females, and All Other Males, (see Table 1 below.) For purposes of DIF 
analyses, the reference group was defined to be the total group; the focal groups were formed 
according to each of the 10 subgroups (see Table 1.) 



Table 1 



Composition of Reference Group and Focal Groups 



Reference Group 


Focal Groups 




Female 


Male 




African American Female 


African American Male 


Total Group 


Asian Female 


Asian Male 




Hispanic Female 


Hispanic Male 




White Female 


White Male 




All Other Female 


All Other Male 
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Formula Scoring Procedures 

The scoring procedure for the SAT I utilizes a formula scoring (FS) procedure and is 
defined as follows: 

FS = Rights * 1 + (Omits and Not Reached) * 0 + ( — ^-) * Wrongs , 

k-1 

where k^number of options for each multiple-choice item. As can be seen, omitted and not 
reached items (NR) are treated differently than are incorrect responses. Whereas points are 
neither awarded nor deducted for omitted/not reached items, incorrect responses to the multiple- 
choice result in the loss of a fi'action of a point. In this case, each incorrect response results in a 
0.25 deduction fi-om the total FS score. 

DIF Detection Procedure — The Standardization Method 

The standardization method (STD) for DIF detection (Dorans and Kulick, 1986; 
Dorans & Schmitt, 1993) was used in this study. As stated by Dorans and Holland (1993), 
standardization method is readily adopted to a formula-scored item, such as those used on SAT- 
Verbal. 

The standardization definition of DIF at the individual score level, m, is given by 
Dyfi = FSfffi - FSfffi , where FS and FSfffi are item-test regressions at the score level m. For 

formula scored items, STD has a DIF index defined by the standardized Formula Score- 
difference (STD FS-DEF), given by 
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PS STD = 




where 




is the weighting factor at score level m. Score level m is supplied by the 



standardization group to weight differences in item performance between the focal group, FSfm, 
and the reference group, FSrm. 

Since the SAT is a formula-scored test, formula-scored DIF given by the standardization 
method indices, STD FS-DIF, is used for DIF evaluation in this study. Using a formula-scored 
DIF procedure for a formula-scored test provides consistent conditions under which the item was 
analyzed. STD FS-DIF scores item as 1 if correct and 0 if incorrect, 0 if omitted, or 0 if NR, it 
incorporates a formula scoring algorithm and assigns zero weight to omitted and NR items, and 

[ ^ ] to incorrect responses, where k is the number of choice options. The STD FS-DIF index 

/C i 



ranges between -1 .25 to +1 .25, inclusive in this case where k=5. 



ETS Classification Criteria 

Educational Testing Service (ETS) relies on a DIF statistic that expresses differences on a 
delta scale as a measure of magnitude of effect. In order to compute this statistic, the Mantel- 
Haenszel common-odds-ratio, aMH, must first be computed. After aMu is derived, it is placed on 
a delta scale via the following logarithmic transformation (Holland & Thayer, 1988): deltaMH = - 
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2.35 In(aMH)- DeltaMH can be interpreted as the average amount a member of the reference group 
foimd the studied item to be more difficult than did a comparable group member of the focal 
group, or vice versa. A value of zero suggests no DIF is present. Similar to the STD FS-DIF 
index, a negative DIF value suggests that that the focal group is disadvantaged and the reference 
group is advantaged while a positive DIF value indicates that focal group is advantaged and the 
reference group is disadvantaged. 

Dorans and Holland (1993) described the ETS DIF classification scheme for use in test 
development. According to the scheme, all the items can be categorized into one of following 
three non-overlapping groups: 

1) Negligible DIF (A-level), which refers to items either for which the magnitude of deltaMH 
values is < 1 delta unit in absolute value or for which deltaMH is not statistically 
significantly different from 0; 

2) Large DIF (C-level), which refers to items with deltaMH > 1 -5 delta unit in absolute value 
and are statistically significantly > 1.0 in absolute value; and 

3) Medium DIF (B-level), which refers to all other items. 

Equipercentile Equating 

An equipercentile equating method was used for the equating in this study. By definition, 
two scores from two different forms of one test may be considered equivalent to one another if 
their corresponding percentile ranks in any given group are equal (Kolen, 1988). The relative 
cumulative frequency distribution for each form is computed and plotted. Examinees scores are 
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then matched for their equal percentile ranks. Both the single group design and the equipercentile 
equating method are very straightforward. Smoothing was not needed since the sample size in 
this research was sufficiently large. 



Results 

DIF summary statistics were reviewed for all verbal and mathematics pretest forms from 
a single administration of the SAT. These summary statistics were reviewed in terms of the 
number of items with sizable DIF and the degree of DIF effects. Specific verbal sections were 
chosen for further screening if items with more sizable DIF were flagged. Of the different 
pretests, only one was retained for this research because it had six C-level (sizable) DIF items. It 
should be emphasized that none of these items was ever administered as an operational item on 
any SAT. 

For DIF analyses, the matching variable used to compute deltaMH was the operational 
score resulting from the 78-item verbal test. For the sake of simplicity, this test form will be 
referred to as Form-X for the duration of this paper. Again, it should be stated that the 
operational form of the SAT was DIF-ffee since no C-level items are ever used on operational 
test forms. In total, there were 35 pretest items and 78 operational items on Form-X. 

Table 2 displays the number of examinees and percentages of subgroups out of the grand 
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Table 2 



Number of Examinees and Percentage of Total in the Data Sample 





African 

American 


Asian 


Hispanic 


White 


All Others 


Total 




437 


299 


313 


3,799 


356 


5,204 


Female 


(4.59%) 


(3.14%) 


(3.29%) 


(39.92%) 


(3.74%) 


(54.68%) 




345 


240 


229 


3,185 


314 


4,313 


Male 


(3.63%) 


(2.52%) 


(2.41%) 


(33.47%) 


(3.30%) 


(45.32%) 




782 


539 


542 


6,984 


670 


9,517 


Total 


(8.22%) 


(5.66%) 


(5.70%) 


(73.38%) 


(7.04%) 


(100%) 



Effects of Deleting Items with C-Level DIF on Scaled Scores 

Subgroup DIF analysis was performed on all items in the studied pretest using the 
operational score as the matching variable. The resulting DIF statistics provided information 
regarding which items exhibited sizable (C-level) DIF. Responses from these flagged items were 
then deleted from the computed raw scores. Three C-level DIF items, Item #1, Item #11, and 
Item #16 were selected for systematic item deletion. In total, there were three rounds of single 
item deletion and one instance of removing all three items at once. 

Dorans (1986) investigated the effects of item deletion on equating/scaling functions and 
reported scaled score distributions. He concluded that re-equating is psychometrically desirable after 
an item is deleted. In this research, equipercentile equating was used to equate the full pretest (35 
items) to the operational test (78 items). Then, shortened tests (32 or 34 items, depending) were also 
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equated to the operational test (78 items) using equipercentile equating. No smoothing was needed 
since the sample was sufficiently large (n=9,517). A standard formula scoring procedure was used as 
discussed earlier in this paper. The distributions of the raw scores and scaled scores on a 20-80 scale 
were obtained for each subgroup and total group. For this specific study, the scaled scores were 
expressed on a 20 to 80 point scale instead of a 200-800 scale so that the observed SSDs could be 
expressed in perspective. 

Re-equating using the equipercentile method was then performed: three times on the 
shortened 34-item test and once on the 32-item test (after removing items #1, #11, and #16, together). 
Resulting scaled scores were then compared between the fixll test and the shortened test forms. 



Table 3 



Numbers and Percentages of Males and Females within each Subgroup 





Afiican 

American 


Asian 


Hispanic 


White 


All Others 


Row 

Total 


Female 


437 

(55.88%) 


299 

(55.47%) 


313 

(57.75%) 


3,799 

(54.40%) 


356 

(53.13%) 


5,204 

(54.68%) 


Male 


345 

(44.12%) 


240 

(44.53%) 


229 

(42.25%) 


3,185 

(45.60%) 


314 

(46.87%) 


4,313 

(45.32%) 


Column 

Total 


782 

(100%) 


539 

(100%) 


542 

(100%) 


6,984 

(100%) 


670 

(100%) 


9,517 

(100%) 
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Sample sizes and percentages by subgroup within its total group can be found in Table 3. The 
one-way STD FS-DIF values and the two-way STD FS-DIF for items #1, #1 1, and #16 can be found 
in subsequent tables. The one-way STD FS -DIF values were derived from the traditional DIF 
analysis using the males and Whites as the reference groups. In contrast, the dissection STD FS-DIF 
values resulted from the two-way DIF methods using the total group as the reference group. 
Unrounded scaled score differences (SSDs) after removing each item are displayed as well. 



Table 4 

One-way STD FS -DIF Values for Item #7 



Reference/focal group 


STD FS-DIF 


Male/Female 


-0.288 


White/ Afiican American 


-0.140 


White/ Asian 


-0.090 


White/Hispanic 


-0.087 



As seen in Table 4, a one-way DIF procedure resulted in a STD FS-DIF index of -0.288 
for Item #1 (using females as focal group). The negative sign of this index indicates that the 
reference group (males) outperformed the focal group (females), suggesting that this item 
disadvantaged the female group. 

In Table 5 below, the two-way STD FS-DIF indices distinctively show that, among male 
subgroups. White males benefited most from Item 1 {STD FS-DIF =0.181). Among the female 
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subgroups, African American females {STD FS-DJF = -0.202) and Asian females {STD FS-DIF 
= -0.192) were most adversely affected, though other female subgroups {STD FS-DIF from 
-0.142 to -0.1 12) were negatively affected as well. 



Table 5 

The Two-way STD FS -DIF Values for Item #7 





African 

American 


Asian 


Hispanic 


White 


All Others 


Total 


Female 


-0.202 


-0.192 


-0.142 


-0.112 


-0.132 


-0.127 


Male 


0.044 


0.099 


0.061 


0.181 


0.112 


0.154 


F — hf Difference 


-0.246 


-0.291 


-0.203 


-0.293 


-0.244 


-0.281 


Total 


-0.093 


-0.062 


-0.056 


0.021 


-0.017 


— 



Row three in Table 5 displays the difference between the female and male two-way STD 
FS-DIF vdXncs for each ethnic group. The values are not significantly different from each other, 
ranging from -0.203 to -0.291, thus, showing little gender by ethnicity interaction. 

Table 6 

Unrounded Scaled Score Differences after Removing Item #] 





African 

American 


Asian 


Hispanic 


White 


All 

Others 


Total 


Female 


0.327 


0.328 


0.195 


0.206 


0.211 


0.223 


Male 


-0.015 


-0.163 


0.066 


-0.198 


-0.096 


-0.160 


Total 


0.176 


0.110 


0.140 


0.022 


0.067 


0.049 
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The STD FS-DIF effect on the scaled score differences (SSDs) after dropping Item #1 
can be seen in Table 6. On average, scaled scores (scale range 20-80) for all male subgroups 
were reduced, except for the Hispanic-male group. The White-male group lost 0.198 points. In 
contrast, on average, each of the five female groups gained at least 0.195 points. For Item #1, 
the groups that were most seriously affected by the DIF were African American female and 
Asian female subgroups. On average, they gained most: 0.327 and 0.328 points when Item #1 
was removed. 



Table 7 

One-way STD FS-DIF Values for Item #I I 



Reference/Focal Group 


STD FS-DIF 


Male/Female 


0.012 


White/ African American 


-0.246 


White/ Asian 


-0.165 


White/Hispanic 


-0.208 



In Table 7, the one-way STD FS-DIF for male/female comparison was 0.012 (A-level 
DIF). In Table 8, the two-way STD FS-DIF output resulting from the two-way scheme indicates 
that Item # 1 1 displays a DIF effect between White and each individual ethnic group. The one- 
way STD FS-DIF values were negative for all ethnic groups: -0.246 for African Americans, 
-0.165 for Asians, and -0.208 for Hispanics. Item #1 1 gave a slight advantage to the White 
group over individual ethnic groups; the two-way STD F5-D/F values (Table 8) for White male 
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and female groups were 0.037 and 0.042, respectively. It should be noted that African American 
females, Asian males and Hispanic males were more seriously affected than the remaining 
subgroups. 



Table 8 

The Two-way STD FS -DIF Values for Item #I I 





African 

American 


Asian 


Hispanic 


White 


All 

Others 


Total 


Female 


-0.166 


-0.066 


-0.124 


0.042 


-0.014 


0.005 


Male 


-0.145 


-0.176 


-0.168 


0.037 


-0.035 


-0.006 


F “ ^NlDifference 


-0.021 


0.11 


0.044 


0.005 


0.021 


0.011 


Total 


-0.157 


-0.115 


-0.142 


0.040 


-0.024 


— 



In Table 8, the female and male difference for the Asian group is 0.1 1 while other groups 
were greatly lower than 0.044. The DIF effect for Asian males was more than twice as much 
compared to the Asian females (-0.176 vs. -0.066), thus, showing gender by ethnicity 
interaction. 



O 

ERIC 



17 



I Educational 
i Testing Service 



Table 9 



Unrounded Scaled Score Differences after Removing Item #7 1 





Afiican 

American 


Asian 


Hispanic 


White 


All 

Others 


Total 


Female 


0.121 


0.017 


0.128 


-0.110 


-0.037 


-0.065 


Male 


0.203 


0.288 


0.205 


-0.083 


0.064 


-0.013 


Total 


0.151 


0.137 


0.161 


-0.098 


0.011 


-0.042 



The SSDs after dropping Item #1 1 are indicated in Table 9. On average, the scaled score 
for the White group, as a whole, decreased 0.098 points while Afiican American, Asian, and 
Hispamc groups gained 0.151, 0.137, and 0.161 points, respectively. By inspecting subgroups, it 
can be seen that the Asian males gained most, 0.288 points on average, followed by 0.205 points 
for Hispamc males and 0.203 points for Afiican American males. Afiican American males were 
also the most disadvantaged subgroup, as indicated by the two-way STD FS-DIF values seen in 
Table 8. 



Table 10 

One-way STD FS -DIF Values for Item #16 



Reference/Focal Group 


STD FS-DIF 


Male/Female 


-0.193 


White/Afiican American 


-0.088 


White/ Asian 


0.059 


White/ Hispanic 


-0.008 
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As indicated in Table 10, the one-way DIF analysis results revealed that Item #16 was 
another gender DIF item {STD FS-DIF = -0. 1 93). Again, the results obtained by the two-way 
approach (Table 1 1) offer clarification at the subgroup level. Values in Table 1 1 indicate that 
African American females were the most disadvantaged of the female subgroups (STD FS-DIF 
-0. 1 46), as seen in Table 1 1 . All male subgroups yielded positive STD FS-DIF values. 

Table 11 

The Two-way STD FS —DIF Values for Item #16 



African Asian Hispanic White All Total 
American Others 



Female 


-0.146 


-0.012 


-0.077 


-0.086 


-0.076 


-0.086 


Male 


0.011 


0.156 


0.099 


0.106 


0.146 


0.103 


F ~ hf Difference 


-0.157 


-0.168 


-0.176 


-0.192 


-0.222 


-0.189 


Total 


-0.077 


0.063 


-0.003 


0.001 


0.028 


— 



Table 1 1 indicates that after deleting item #16, the female and male DIF effects differ 
similarly (from -0.157 to -0.222), showing little gender by ethnicity interaction. 
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Table 12 

Unrounded Scaled Score Differences after Removing Item #16 





African 

American 


Asian 


Hispanic 


White 


All Others 


Total 


Female 


0.144 


-0.060 


0.032 


0.131 


0.082 


0.112 


Male 


0.012 


-0.221 


-0.135 


-0.113 


-0.166 


-0.114 


Total 


0.086 


-0.139 


-0.039 


0.020 


-0.034 


0.009 



As seen in Table 12, after removing Item #16, African American females, on average, 
gained the most points (0.144); note also that they were the most disadvantaged group shown in 
Table 11. The Asian male group, on the other hand, lost 0.221 points on average, followed by 
All Others males (0.166), Hispanic males (0.135), and White males (0.1 13). 



Table 13 

Unrounded Scaled Score Differences after Removing Item #1, #11, and #16 





African 

American 


Asian 


Hispanic 


White 


All 

Others 


Total 


Female 


0.789 


0.355 


0.482 


0.173 


0.244 


0.259 


Male 


0.354 


-0.150 


0.101 


-0.517 


-0.296 


-0.378 


Total 


0.597 


0.130 


0.321 


-0.142 


-0.009 


-0.030 
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Table 13 summarizes the SSDs between the full pretest (35 items) and the shortened test 
(32 items) resulting from dropping Item #1, Item #11, and Item #16. Between male and female 
groups, males lost an average of 0.378 scaled points and females gained, on average, 0.259 
points after dropping all three items. Among the intact ethnic groups, the average score increase 
for African Americans was 0.597 points, followed by an increase of 0.321 points for the 
Hispanic group. Within subgroups. White males lost an average of 0.517 points while scaled 
scores for African American females increased by an average of 0.789 points. Hispanic females 
also gained an average of 0.482 points from the deletion of this set of items. 

Obtaining the One-way DIF Using Subgroup Two-way DIF 

Two-way DIF indices for males and females within each subgroup can be used to 
derive the one-way DIF indices for each gender and ethnic group. When two subpopulations are 
of equal size, the total population (TP) average, indicated by AveTP, is simply the weighted sum 
of the two sub-population (SP) averages, referred to as AveSPl and AveSP2, respectively: 

AveTP = 0.5* (AveSPl + AveSP2) (1) 

The difference between AveSPl and AveTP is defined to be 

AveSPl - AveTP = AveSPl - .5*(AveSPl + AveSP2) 

=> AveSPl - AveTP = .5*(AveSPl - AveSP2). 

By the same reasoning, 

AveSP2 - AveTP = AveSP2 - .5*(AveSPl + AveSP2) 

=> AveSP2 - AveTP = .5*(AveSP2 - AveSPl). 
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However, generally, sub-populations have unequal sample sizes. When two 
subpopulations are of unequal sizes, the total population AveTP is simply the weighted sum of 
the two sub-population averages, AveSPl and AveSP2, respectively, in which the weights sum 
to 1. 

Hence, AveTP = (wl * AveSPl + w2*AveSP2), (2) 

where weights wl and w2 are the proportions of sample sizes for each subgroup, and wl + w2 =1. 

In the context of regular one-way DIF analysis, the object of investigation is the 
difference between two total group means, for example, male and female groups, which is 
equivalent to (AveSPl - AveSP2). However in the context of two-way DIF, the differences 
between each subgroup and the total group mean is examined, which is equivalent to (AveSPl- 
AveTP). 

Therefore, AveSPl - AveTP = AveSPl - (wl*AveSPl + w2*AveSP2) 

=> AveSPl - AveTP = w2*(AveSPl - AveSP2) (3) 

and 

AveSP2 - AveTP = AveSP2 - (wl*AveSPl + w2*AveSP2) 

=> AveSP2 - AveTP = wl *(AveSP2 - AveSPl) (4) 

In the context of this paper, we make the assumption that AveTP or Ave DIF in the 
total population = 0. Thus, equation (3) becomes, 

AveSPl =w2*(AveSPl - AveSP2), (5) 

and equation (4) becomes. 



AveSP2 = wl*(AveSP2- AveSPl)= -wl*(AveSPl-AveSP2) 



( 6 ) 
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In the context of one-way DIF procedures, the one-way male vs. female DIF is (AveSPl- 
AveSP2). By simple algebraic manipulations, the difference of equations (5) and (6) can be 
expressed as 

AveSPl -AveSP2 = AveSPl/w2 = -AveSP2/wl . (7) 

By applying equation (2), the one-way DIF indices for ethnic groups can be obtained by 
summing the weighted ethnic subgroup DIF indices, where the weights being applied are the 
proportion of their sample sizes over the total group as shown in Table 3. Consider the following 
example for Item #1 . 

Example 1 : 

Let SPl be African American females and SP2 be African American males. Equation (2) 
can be expressed as AvcTPaf = (wl *AveSPl female + w2*AveSP2 male)- The weight for subgroup 
1 (African Americans females) is .5588 and the weight for subgroup 2 (African Americans 
males) is .4412; these values can be found in Table 3. In Table 5, the two-way STD FS-DIF 
values for AveSPl female and AveSP2 male are -0.202 and 0.044, respectively. By substituting 
these values into equation (2), we derive the following value: 

AvcTPaf = (0.5588*(-0.202) + 0.4412*0.044) = (-0.1 1288) + 0.01941 = -0.09347, 
which is the one-way DIF value for the African American group (see in Table 5). It can be 
shown that one-way DIF values for all others ethnic and gender groups can be derived 
accordingly. 

Equation (7) can also be used to obtain one-way gender DIF values on Item #1 . Consider 
the following example. 



Example 2: 
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Let SPl be Females and SP2 be Males. AveSPl female and AveSP2maie can be found jfrom 
the two-way DIF values located in Table 6 and proportions of male and female groups located in 
Table 4. Equation (5) for this Male/Female DIF is: 

AveSPlfemale ' AveSP2 male = AveSPl female /w male = (-0. 127)/(.4532) = -0.280, 
which is close to the one-way DIF value (-0.288) for Male/Female DIF (see Table 5.) Also note 
that 

AveSP 1 female -AveSP2 male = -AveSP2 male /w female = -(0.1 54)/ (.5468) = -0.282 
is remarkably close to the one-way DIF value (-0.288) for Male/Female DIF shown in Table 4. 



Prediction Based on the Standardization Approach 

It was hypothesized that the effects of item deletion on scores could be predicted by the 
standardization DIF detection method. To be specific, the deletion of a negative DIF item should 
benefit the focal group whereas the deletion of a positive DIF item should benefit the reference 
group. In order to test if the standardization method can indeed predict DIF effects of item 
deletion on scores, correlation analyses were conducted between SSDs for each subgroup after 
each item deletion scheme and their formula score DIF effects. 
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Table 14 

Correlation Between Scaled Score Differences and STD FS-DIF Indices 



Item Deleted 


Correlation 


#1 


-0.972 


#11 


-0.944 


#16 


-0.963 



Table 14 shows that the correlation indices were very strongly negative. These high 
negative correlations highlight the strong negative relationship between SSDs and the two-way 
STD FS-DIF indices. When SSD increases (i.e., the focal group benefits from the item deletion), 
the two-way FS STD-DIF value for that item is negative. When SSD decreases (i.e., focal group 
is disadvantaged by the item deletion), the two-way FS STD-DIF value for that item is positive. 
All other deletions of an item with negative DIF result in a positive change in scaled scores. 

Linear regression analyses were performed between SSDs following the removal of each 
item and two-way STD FS-DIF values (the predictor variable). Scatter plots between mean 
SSDs and the two-way DIF indices after removing Items #1, #1 1, and #16, are shown in Figures 
1-3. Numbers 1 through 10 in the scatter plots represent group membership, where l=Afncan 
American Females, 2=A11 Others Females, 3=Asian Females, 4=Hispanic Females, 5=White 
Females, 6=African American Males, 7=A11 Others Males, 8=Asian Males, 9=Hispanic Males, 
and 10= White Males. Corresponding regression equations appear below each figure. It should 
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be stated that these regression equations were included as they were descriptive for this small 
sample. They have little generalizability. 



Figure 1 

Scatter Plot of Regression Analysis after Removing Item #1 




The regression equation is given by SSDuem i= 0.048 - 1.344* FS STD DIFnem i 
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Figure 2 

Scatter Plot of Regression Analysis after Removing Item #11 




The regression equation is given by 



SSDitem 11= -0.040 - 1.465* FS STD DIFuem ii 
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Figure 3 

Scatter Plot of Regression Analysis after Removing Item #16 




The regression equation is given by 



SSDiten, 16= -0.016 - 1.139* FS STD DIF „e„, ,6 



These results show that there is a strong negative relationship between the SSDs and the 
two-way FS STD-DJF indices. Each increase of 0. 10 in FS STD-DIF index is accompanied by a 
scaled score reduction of less than one whole scaled score point on the 20 to 80 point scale, a 



small but noticeable shift nonetheless. 
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Discussions 

This research has shown that the act of deleting large-DIF items from an assessment 
instrument can differentially affect subgroup-level performance. In this research, the reference 
group was defined to be the total group while each of the subgroups independently acted as a 
focal group (the Dissection DIF method). Since different DIF effects exist in each subgroup, it is 
believed that using a combination of all groups as the reference group produces more accurate, 
though potentially less stable, findings than using a simple majority group approach. 

As we hypothesized, the deletion of DIF items disadvantageous to a particular group has 
been shown to affect that group the most. Scaled score differences after item deletion and re- 
equating did vary among subgroups depending on the DIF effects. Those groups found to be 
disadvantaged via the two-way DIF approaches when all three items were deleted gained points 
whereas those thought to be advantaged, lost points. In particular, Afiican American females 
gained most when all three items were deleted which was consistent with the fact that they were 
disadvantaged on all those items. However, the gained and lost points amounted to less than one 
scaled-point on a 20 to 80 point scale. 

We also hypothesized that the effects of item deletion on scores can be predicted based 
on the standardization method. Regression analyses confirmed that the standardization DIF 
method can reliably predict score changes. It was shown that by deleting a negative DIF item 
that the focal group is benefited and by deleting a positive DIF item, the reference group is 
benefited. However, the sample sizes were not adequately large to generalize the findings. 

The purpose of using the dissection classification scheme within the context of a two-way 
procedure is to examine gender by ethnicity interactions that traditional DIF grouping methods. 
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i.e. one-way methods, do not allow. The dissection classification method places everyone in the 
reference group simultaneously. In terms of gender by ethnicity interaction, these results show 
that the among the three item deletion scenarios, the two way DIF effects showed very little 
interaction except one case: the Asian group. Futme work will investigate the nature of this 
interaction. 

The dissection and two-way DIF method may benefit large-scale standardized testing 
programs. This more informative approach to DIF analysis not only confirms findings from the 
traditional (one-way) DIF approach but also enhances our understanding of the behavior of DIF 
items. We have shown that the act of deleting a large DIF item can (and does) have differential 
impact at the subgroup level. DIF detection procedures done via a two-way approach can offer 
valuable help to the decision-making process, especially when determining impact due to item 
deletion prior to score reporting. In addition, it was shown that by summing weighted two-way 
DIF values for each ethnic subgroup, one-way DIF indices for ethnic groups can be obtained. 

The one-way DIF values for gender groups can be derived by entering two-way DIF values of 
gender subgroups into weighted equations. Additional information can be obtained by looking at 
the scaled score changes at the subgroup level and proactively surveying to what extent the most 
disadvantaged groups may be affected. 

One way to understand one-way DIF analysis and two-way DIF method is through the 
analogy of analysis of variance (ANOVA). Conducting a one-way DIF analysis is similar to 
conducting a one-way ANOVA, where each ethnic/racial group and gender group functions as a 
main effect. In contrast, a two-way DIF analysis is similar to a two-way ANOVA, where 
information regarding interactions is available. 
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Limitations 

A limitation of this study is the limited sample sizes for ethnic/racial subgroups. In cases 
where small samples are used for analyses, the standardization method might produce unstable 
DIF estimates and prevent generalization of the results. A possible follow-up study to this 
research could be to apply kernel smoothing, a process currently used in the ETS comprehensive 
statistical analysis system GENASYS. This process is usually reserved for total group analyses 
only. One possibility is to investigate using kernel smoothing on small samples so as to facilitate 
subgroup DIF analyses. Another issue worth investigating is to obtain predicted scaled scores on 
the shortened tests by applying the full test local linear approximation (Dorans, 1984) and then 
compare them with the observed scaled score values for each focal group. 

The substantive DIF findings obtained in this study should be interpreted cautiously. 

First, the final forms of SAT rarely contain DIF items because of the rigorous and proactive 
screening of pretests items. Second, the scaled scores used in this study were based on a single 
pretest, which is less than half the length of actual tests (35 items vs. 78 items). The observed 
effects on this pretest resulted from the artificial circiunstances associated with using a 35-item 
pretest to produce a test score. Dropping one item from a 78-item test affects scores much less 
than dropping one item from a 35-item test. It should be stated that we examined 60 pretests for 
DIF results before finding a pretest that had enough C items to adequately illustrate the 
dissection DIF approach. 
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Notes 

This research was funded by The College Board. The authors wish to express their 
thanks to the College Board for allowing them to use SAT I: Verbal data for this research. The 
opinions expressed herein are those of the authors and should not be interpreted as ETS or 
College Board policy statements. 
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